feat(events): emit 5-way token breakdown + context-window utilization in message_complete by byapparov · Pull Request #87 · aictrl-dev/cli

byapparov · 2026-06-22T03:41:23Z

Summary

5-way token breakdown in message_complete: expands tokens from an opaque info.tokens passthrough to an explicit object — input / output / reasoning / cache.read / cache.write — mirroring upstream LLM.Usage. The data was already captured in MessageV2.Assistant.tokens via StepFinishPart accumulation; this change surfaces all fields explicitly.
Context-window utilization added as context: { used, limit, ratio } where used = input + cache.read, limit comes from Provider.getModel() → model.limit.context (models.dev registry), and ratio = used / limit. Emits null when the model's context limit is not known (unregistered custom endpoint).
cost: 0 trap avoided: info.cost is kept as-is — it accumulates real per-step cost from StepFinishPart. The new upstream step.ended event emits cost: 0 (reconciled later by a projector); we do not touch that path.
EVENTS.md updated with the extended schema, field-by-field documentation, and non-overlap invariant note.

How the cache split was already captured

MessageV2.StepFinishPart (the legacy step-finish message part) already has tokens: { input, output, reasoning, cache: { read, write } }. The assistant message info.tokens is accumulated from these step parts — cache split included. No new provider-level capture was needed; we just stop dropping it in the emit call.

Context limit source

Provider.getModel(providerID, modelID) returns the model record from the models.dev registry, which has model.limit.context. The lookup is wrapped in a .catch(() => null) so an unknown model (custom endpoint) gracefully emits context: null rather than throwing.

Test plan

TDD: wrote failing test in packages/cli/test/cli/usage-token-breakdown.test.ts (8 cases) — confirmed RED before implementation
Implemented changes in packages/cli/src/cli/cmd/run.ts
bun test test/cli/usage-token-breakdown.test.ts — 8/8 GREEN
bun test test/cli/ — 77/77 GREEN (no regressions)
bun run typecheck — clean
Pre-push hook ran bun turbo typecheck across all 5 packages — 6/6 tasks successful

Closes #86

🤖 Generated with Claude Code

https://claude.ai/code/session_0187MsfK1upr6K2BKVbmaebQ

… in message_complete (#86) - Expand `tokens` in `message_complete` from an opaque `info.tokens` passthrough to an explicit object with all 5 fields: input / output / reasoning / cache.read / cache.write — mirroring upstream LLM.Usage shape. The data was already captured in MessageV2.Assistant.tokens via StepFinishPart accumulation; this change surfaces it explicitly. - Add `context: { used, limit, ratio }` to `message_complete`: - `used = input + cache.read` (tokens occupying the context window this turn) - `limit` sourced from Provider.getModel() → model.limit.context (models.dev) - `ratio = used / limit`; emits `null` when limit is unknown (unregistered endpoint) - Cost kept correct: `info.cost` accumulates real per-step cost from StepFinishPart, NOT from the new step.ended event which emits cost:0 (the cost:0 trap). - Update EVENTS.md with the extended schema and field-by-field documentation. - Add TDD test file (RED→GREEN): `test/cli/usage-token-breakdown.test.ts`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0187MsfK1upr6K2BKVbmaebQ

aictrl-dev · 2026-06-22T07:56:29Z

Code review

Verdict: Address the major findings before merging. · 🔴 0 · 🟠 1 · 🟡 1 · ⚪ 0 · 0/2 resolved

🟠 packages/cli/src/cli/cmd/run.ts:488-494 — limit:0 yields Infinity ratio, breaks null contract
🟡 packages/cli/test/cli/usage-token-breakdown.test.ts:63 — Tests grep source text instead of running emit path

🤖 Fix all 2 open findings with your agent

Fix the following code review findings on aictrl-dev/cli PR #87 (head branch).
Run the relevant tests/linters after each change.

1. packages/cli/src/cli/cmd/run.ts:488-494 — limit:0 yields Infinity ratio, breaks null contract
   Detail: The guard `contextLimit != null` does not catch the case where the model's context limit is `0`. `Provider.Model.limit.context` is a required `z.number()` (provider.ts:703), and config-defined custom models without a limit default `context` to `0` (provider.ts:929: `context: model.limit?.context ?? existingModel?.limit?.context ?? 0`). For such models `Provider.getModel` succeeds and returns `context: 0`, so `contextLimit` is `0` (not `null`), the guard passes, and `ratio = contextUsed / 0` evaluates to `Infinity` (or `NaN` when `contextUsed` is also `0`). `JSON.stringify` then serializes `Infinity`/`NaN` as `null` *inside* the context object, yielding `{ used: <N>, limit: 0, ratio: null }` instead of the documented top-level `null`. EVENTS.md explicitly promises: "null — emitted when the model's context limit is not known (e.g. unregistered custom endpoint)". This diverges from the documented contract and emits a misleading object for the exact scenario it claims to handle. Fix: also require `contextLimit > 0` before emitting the object.
   Suggested fix: Change the guard from `contextLimit != null` to `contextLimit != null && contextLimit > 0` so a context limit of `0` (config/custom models without a known limit) is treated as "unknown" and emits `null` as documented, avoiding `Infinity`/`NaN` in the ratio.
2. packages/cli/test/cli/usage-token-breakdown.test.ts:63 — Tests grep source text instead of running emit path
   Detail: The new tests assert behavior by reading `run.ts` as a string and matching regexes against the source text rather than exercising the emit path. Several assertions are loose enough to pass even if the logic is wrong:\n\n- `test("context.ratio is used / limit")` (line 63) uses `/ratio.*\\/|\\/.*ratio|used\\s*\\/\\s*limit|contextLimit/`. The `contextLimit` alternative matches the bare variable declaration on line 483, so the test stays green even if `ratio` is changed to a constant (e.g. `ratio: 0`).\n- `test("context.limit is sourced from Provider.getModel context limit")` (line 57) uses `/Provider\\.getModel|limit\\.context|contextLimit/` — any one of those tokens anywhere in the file passes; it does not verify the value is actually wired into `context.limit`.\n\nThese give a false sense of coverage: a refactor that renames a variable or removes the real computation but leaves a stray token would not be caught. Prefer asserting against actual emitted JSON (e.g. stub `Provider.getModel` and capture `process.stdout` writes from `emit("message_complete", …)`) so the 5-way `tokens` and `context` shapes and the `used = input + cache.read` / `ratio = used/limit` computations are verified at runtime.

📋 Out-of-diff findings (2)

Sev	Location	Finding
🟠	`packages/cli/src/cli/cmd/run.ts:488-494`	limit:0 yields Infinity ratio, breaks null contract
🟡	`packages/cli/test/cli/usage-token-breakdown.test.ts:63`	Tests grep source text instead of running emit path

Reviewed 3 files · 0 inline · view all 2 findings ↗

aictrl · AI code review for fast-moving teams · aictrl.dev

…dContextWindow Custom models without a registered context limit default to `limit.context = 0` (provider.ts:929). The old guard `contextLimit != null` passed for 0, causing `ratio = used / 0 = Infinity`, which JSON.stringify serialises as `null` *inside* the context object — diverging from the documented top-level null contract in EVENTS.md ("null — emitted when the model's context limit is not known"). Fix: extract pure helper `buildContextWindow(limit, used)` that returns null when limit is null or <=0. This also makes the computation unit-testable. Replace source-grep tests (which could pass even with wrong logic, per bot review) with 10 behavioural unit tests of `buildContextWindow` covering: null limit, zero limit (🟠 regression case), ratio computation, JSON-serialisability, and the top-level-null contract. Retain slim source-text checks for structural wiring. Fixes review findings from PR #87 (aictrl-dev bot): - 🟠 limit:0 yields Infinity ratio, breaks null contract - 🟡 Tests grep source text instead of running emit path

byapparov · 2026-06-22T08:05:39Z

Review response — PR #87

Triaged 2 findings (🟠 1, 🟡 1). Both verified TRUE; both fixed.

Issues addressed (pushed to this PR)

limit:0 yields Infinity ratio, breaks null contract — packages/cli/src/cli/cmd/run.ts:488: Verified TRUE. provider.ts:929 confirms custom models default context to 0. The old contextLimit != null guard passes for 0, so ratio = used / 0 = Infinity, which JSON.stringify serialises as null inside the object — breaking the documented top-level null contract in EVENTS.md line 113. Fixed by extracting buildContextWindow(limit, used) which returns null when limit == null || limit <= 0. (commit 7bbf2045d)
Tests grep source text instead of running emit path — packages/cli/test/cli/usage-token-breakdown.test.ts:63: Verified TRUE. context.ratio is used / limit matched the bare contextLimit variable declaration, not the computation — the test stayed green with ratio: 0 hardcoded. Fixed by replacing the 5 loose regex tests with 10 behavioural unit tests of the extracted buildContextWindow helper: null limit, zero limit (regression case), ratio math, JSON-serialisability, top-level-null contract. Source-text checks retained only for structural wiring that can't be covered by the pure helper. (commit 7bbf2045d)

Review claims verified false (no change needed)

(none — both findings were genuine)

Not addressed here

(none — all findings fixed)

aictrl-dev · 2026-06-22T09:52:31Z

Code review

Verdict: Looks good — only minor / nit comments below. · 🔴 0 · 🟠 0 · 🟡 3 · ⚪ 3 · 0/6 resolved

⚪ EVENTS.md:91 — Example ratio 0.049 ≠ emitted 0.04912
⚪ packages/cli/src/cli/cmd/run.ts:88 — JSDoc hardcodes provider.ts:929 (line rot)
🟡 packages/cli/src/cli/cmd/run.ts:102 — ratio unclamped but documented as 0–1
🟡 packages/cli/src/cli/cmd/run.ts:509 — .catch(()=>null) swallows non-not-found errors
🟡 packages/cli/src/cli/cmd/run.ts:510 — Verify upstream tokens shape for all providers
⚪ packages/cli/test/cli/usage-token-breakdown.test.ts:217 — Source-substring tests are brittle

🤖 Fix all 6 open findings with your agent

Fix the following code review findings on aictrl-dev/cli PR #87 (head branch).
Run the relevant tests/linters after each change.

1. EVENTS.md:91 — Example ratio 0.049 ≠ emitted 0.04912
   Detail: The `message_complete` example shows `"context": { "used": 9824, "limit": 200000, "ratio": 0.049 }`, but `buildContextWindow` emits the unrounded float: `9824 / 200000 = 0.04912`. The example rounds to 3 decimals, which may mislead consumers who treat the doc as a byte-exact spec. Either note that `ratio` is an unrounded float, or pick an example whose `used/limit` is exact (e.g. used 9800 → 0.049).
2. packages/cli/src/cli/cmd/run.ts:88 — JSDoc hardcodes provider.ts:929 (line rot)
   Detail: The `buildContextWindow` JSDoc references `` `context: 0` (provider.ts:929) ``. A hardcoded file:line citation will silently rot the moment `provider.ts` is edited, turning the comment into misleading guidance. Reference the behavior/symbol (e.g. "the provider's default `limit.context = 0` for unregistered custom models") instead of a line number.
3. packages/cli/src/cli/cmd/run.ts:102 — ratio unclamped but documented as 0–1
   Detail: `buildContextWindow` computes `ratio: contextUsed / contextLimit` with no clamp, but EVENTS.md documents `ratio` as `(0–1)`. The code can legitimately produce `ratio > 1`: if `contextUsed` ever exceeds the registry limit (stale/lowered `model.limit.context` after a model update, or a provider whose actual billed prompt exceeds the registered window), `ratio` exceeds 1 and contradicts the documented range. Consumers relying on `0 ≤ ratio ≤ 1` (e.g. for a utilization bar) will break. Either clamp (`Math.min(1, used/limit)`) or relax the doc to note values may exceed 1 when usage exceeds the registered limit.
   Suggested fix: In buildContextWindow: `ratio: Math.min(1, contextUsed / contextLimit),` — or update EVENTS.md from "ratio (0–1)" to "ratio (0–1; may exceed 1 if usage exceeds the registered limit)".
4. packages/cli/src/cli/cmd/run.ts:509 — .catch(()=>null) swallows non-not-found errors
   Detail: `Provider.getModel(...).catch(() => null)` turns *every* rejection into the "unknown context" sentinel — not only the intended "model not registered" case. A transient registry fetch failure, a programming error in `getModel`, or a malformed `m.limit` shape would all silently degrade `message_complete.context` to `null` with zero signal, making regressions invisible. Consider catching only the expected not-found error type, or logging the unexpected error before falling back to `null`.
5. packages/cli/src/cli/cmd/run.ts:510 — Verify upstream tokens shape for all providers
   Detail: The correctness of both `context.used` and the `reasoning` field hinges on the upstream `info.tokens` shape that this diff cannot show:
1. `contextUsed = tokens.input + tokens.cache.read` is correct only if `input` EXCLUDES cached tokens (Anthropic semantics). For OpenAI, raw `prompt_tokens` historically INCLUDES cached tokens — if the upstream `LLM.Usage` isn't normalized per-provider, `used` double-counts cache reads, and the EVENTS.md "a token is counted in exactly one bucket" guarantee silently fails for those providers.
2. `reasoning: info.tokens.reasoning` emits `undefined` (dropped by JSON.stringify → missing field) if any provider's `StepFinishPart` doesn't populate `reasoning`.
Please confirm the upstream normalization covers every provider before relying on these metrics; otherwise guard with `(info.tokens.reasoning ?? 0)` and verify `input` excludes cache for OpenAI.
6. packages/cli/test/cli/usage-token-breakdown.test.ts:217 — Source-substring tests are brittle
   Detail: The "emit block shape" describe block reads `run.ts` as raw text and asserts on substrings (`"reasoning"`, `"cache"`, `"buildContextWindow"`, `"context"`). These assert on source *text*, not behavior: a rename/reformat breaks the test while behavior is unchanged (false negative), and a coincidental substring elsewhere in the 1500/500-char window could falsely pass (false positive). Prefer capturing the actual emitted event in a small integration test, or at least tighten the window to the exact emit object literal.

📋 Out-of-diff findings (6)

Sev	Location	Finding
⚪	`EVENTS.md:91`	Example ratio 0.049 ≠ emitted 0.04912
⚪	`packages/cli/src/cli/cmd/run.ts:88`	JSDoc hardcodes provider.ts:929 (line rot)
🟡	`packages/cli/src/cli/cmd/run.ts:102`	ratio unclamped but documented as 0–1
🟡	`packages/cli/src/cli/cmd/run.ts:509`	.catch(()=>null) swallows non-not-found errors
🟡	`packages/cli/src/cli/cmd/run.ts:510`	Verify upstream tokens shape for all providers
⚪	`packages/cli/test/cli/usage-token-breakdown.test.ts:217`	Source-substring tests are brittle

Reviewed 3 files · 0 inline · view all 6 findings ↗

aictrl · AI code review for fast-moving teams · aictrl.dev

…range and example - .catch(()=>null) narrowed to catch only Provider.ModelNotFoundError and rethrow unexpected errors — avoids silently swallowing registry/programming faults - EVENTS.md: ratio range updated from "(0–1)" to "(≥0; may exceed 1)" to match the unclamped division; example ratio fixed to 0.04912 (was rounded 0.049) - JSDoc provider.ts:929 hardcoded line citation replaced with behavioral description - Tests: +ratio-exceeds-1 unit test; +source-verified targeted-catch regression test

byapparov · 2026-06-22T09:59:33Z

Round-2 Review Triage — PR #87 · feat/usage-token-breakdown

Verdict: All 6 findings triaged. 3 fixed, 1 FALSE (stale), 1 FALSE (upstream already normalized), 1 DEFER (no correctness risk).

#	Sev	Location	Finding	Action	Verdict
1	🟠	`run.ts:488` (inline)	`limit:0` yields Infinity ratio	FALSE	Already fixed in round-1 commit: `buildContextWindow` guard is `contextLimit == null \|\| contextLimit <= 0`. Bot re-reviewed stale diff.
2	🟡	`test.ts:63` (inline)	Source-grep tests brittle	DEFER	Wiring checks are intentionally supplemental; `buildContextWindow` has full behavioral unit coverage. Rewriting to stdout-capture integration tests is out of scope for this PR.
3	🟡	`run.ts:102`	ratio unclamped but doc says 0–1	FIX	EVENTS.md updated: `(0–1)` → `(≥0; may exceed 1 if usage exceeds the model's registered limit)`. Added `ratio > 1` unit test.
4	🟡	`run.ts:509`	`.catch(()=>null)` swallows all errors	FIX	Narrowed to `catch (e) { if (e instanceof Provider.ModelNotFoundError) return null; throw e }`. `getModel` only throws `ModelNotFoundError` (provider.ts:1212/1220) but unexpected errors now propagate. Added source-verified regression test.
5	🟡	`run.ts:510`	Verify upstream tokens shape for all providers	FALSE	`session/index.ts:829–843` explicitly normalizes per-provider: `adjustedInputTokens = excludesCachedTokens ? inputTokens : inputTokens - cacheRead - cacheWrite`. `reasoning` defaults via `?? 0` at line 817. The "a token in exactly one bucket" contract holds.
6	⚪	`EVENTS.md:91`	Example ratio 0.049 ≠ emitted 0.04912	FIX	Example updated to `0.04912` (exact: 9824/200000).
7	⚪	`run.ts:88`	JSDoc hardcodes `provider.ts:929` (line rot)	FIX	Replaced line citation with behavioral description: `limit.context = 0` (the provider's default for unregistered custom models)`.

Changes: commit 7e99d4c on feat/usage-token-breakdown

/packages/cli/src/cli/cmd/run.ts — targeted catch + JSDoc fix
EVENTS.md — ratio range + example value
/packages/cli/test/cli/usage-token-breakdown.test.ts — 2 new tests (ratio>1, targeted-catch source check)

Tests: 20 pass (was 18) · bun run typecheck clean

aictrl-dev · 2026-06-22T10:06:19Z

Code review

Verdict: Address the major findings before merging. · 🔴 0 · 🟠 1 · 🟡 1 · ⚪ 0 · 0/2 resolved

🟠 packages/cli/src/cli/cmd/run.ts:514 — context.used excludes cache.write tokens
🟡 packages/cli/test/cli/usage-token-breakdown.test.ts:96-120 — Tests assert on source text, not behavior

🤖 Fix all 2 open findings with your agent

Fix the following code review findings on aictrl-dev/cli PR #87 (head branch).
Run the relevant tests/linters after each change.

1. packages/cli/src/cli/cmd/run.ts:514 — context.used excludes cache.write tokens
   Detail: `context.used` is computed as `tokens.input + tokens.cache.read`, which omits `cache.write` tokens. Per the EVENTS.md definition, `used` is "tokens occupying the model's context window this turn," but cache.write (cache-creation) tokens ARE part of the prompt sent to the model and DO occupy the context window — they're just billed at the cache-write rate instead of the standard rate. The total prompt occupying the context window is `input + cache.read + cache.write`.&#10;&#10;This undercounts utilization most on the first turn of a conversation, where a large prefix is being written to the cache (large `cache.write`, small/zero `cache.read`). Since the entire purpose of this field is to "signal context-exhaustion risk" (ratio approaching/exceeding 1), undercounting `used` delays that signal. The EVENTS.md example illustrates the gap: input=1024, cache.read=8800, cache.write=1024 → reported `used`=9824, but actual prompt size is 10848.&#10;&#10;Fix: `const contextUsed = tokens.input + tokens.cache.read + tokens.cache.write` and update the EVENTS.md definition of `used` to `input + cache.read + cache.write`.
   Suggested fix: Include cache.write in the sum: `const contextUsed = tokens.input + tokens.cache.read + tokens.cache.write`, and update EVENTS.md so `used` is documented as `input + cache.read + cache.write` (total prompt tokens).
2. packages/cli/test/cli/usage-token-breakdown.test.ts:96-120 — Tests assert on source text, not behavior
   Detail: The "message_complete emit block shape" describe block reads `run.ts` from disk and asserts that substrings ("reasoning", "buildContextWindow", "context", "cost:") appear near the `emit("message_complete")` call site. These are source-text checks, not behavioral tests:&#10;&#10;1. They give false confidence — they pass even if `buildContextWindow`'s result is never wired into the emitted object, or if `info.tokens.reasoning` is `undefined` at runtime (which would make the emitted JSON drop the field, breaking the documented "0 when thinking is off" contract).&#10;2. They're brittle — renaming a variable or extracting the emit into a helper breaks them with no real regression.&#10;3. The actual emit path (token projection, `contextUsed = input + cache.read`, the `Provider.getModel` catch) has no behavioral coverage; the `buildContextWindow` unit tests only cover the pure helper with pre-computed inputs.&#10;&#10;Consider a behavioral test that drives the real emit path with a stubbed `Provider.getModel` and a fake `info`, then asserts the exact emitted `message_complete` payload shape (including that `reasoning` is a real number).
   Suggested fix: Replace the source-text assertions with a behavioral test: invoke the real emit path with a mocked `Provider.getModel` (resolve a known limit) and a fake `info.tokens`, then assert the emitted `message_complete` object equals the expected 5-way tokens + context shape. This also covers the `contextUsed` arithmetic and the reasoning/cache wiring that are currently unverified.

📋 Out-of-diff findings (2)

Sev	Location	Finding
🟠	`packages/cli/src/cli/cmd/run.ts:514`	context.used excludes cache.write tokens
🟡	`packages/cli/test/cli/usage-token-breakdown.test.ts:96-120`	Tests assert on source text, not behavior

Reviewed 3 files · 0 inline · view all 2 findings ↗

aictrl · AI code review for fast-moving teams · aictrl.dev

…zation cache.write tokens are part of the prompt sent to the model on the current turn (they occupy the context window, just billed at the cache-write rate). Omitting them undercounted utilization most on the first turn, where a large prefix is written to the cache. Fix: contextUsed = input + cache.read + cache.write. EVENTS.md: update used definition and example (used: 10848, ratio: 0.05424). Tests: regression test verifying three-way sum at call site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0187MsfK1upr6K2BKVbmaebQ

byapparov · 2026-06-22T10:14:12Z

Round 3 review triage · PR #87

Verdict: Fixed the major; deferred the minor. Converging — both threads resolved.

🟠 `run.ts:514` — context.used excludes cache.write tokens

TRUE — real bug, applied FIX.

cache.write tokens ARE part of the prompt sent to the model on the current turn — they occupy the context window (just billed at the cache-write rate, not the standard input rate). The original code tokens.input + tokens.cache.read omits them, undercounting utilization most on the first conversation turn when a large prefix is being written to the cache. The EVENTS.md example made this concrete: input=1024, cache.read=8800, cache.write=1024 but used=9824, missing the 1024 cache.write bytes — actual context = 10848.

This is not a round-2 regression (round-2 only changed the .catch() to targeted ModelNotFoundError). It is a pre-existing bug in the original formula that the bot correctly surfaces.

Fix applied in commit b5babcbdc:

run.ts:514: contextUsed = tokens.input + tokens.cache.read + tokens.cache.write
EVENTS.md: definition updated to input + cache.read + cache.write; example updated to used: 10848, ratio: 0.05424
usage-token-breakdown.test.ts: test description + example values updated; added source-text regression asserting cache.write is present in the contextUsed sum

21 tests pass, typecheck clean.

🟡 `test:96-120` — Tests assert on source text, not behavior

TRUE in principle — DEFER.

The critique is accurate: source-text grep tests don't verify the emit-path wiring at runtime. However, replacing them with real behavioral tests requires mocking the Provider.getModel async chain and capturing process.stdout from the message_complete emit inside an async for-of event loop in run.ts. That's a meaningful refactor (extracting the emit helper to be injectable) tracked as a follow-up. The current source-text tests do at minimum guard against accidental deletion of key wiring tokens; the newly-added regression test (cache.write in contextUsed) adds one more behavioral anchor at the unit level.

Action: ACK + DEFER. No code change this round.

aictrl-dev · 2026-06-22T10:19:38Z

Code review

Verdict: Looks good — only minor / nit comments below. · 🔴 0 · 🟠 0 · 🟡 2 · ⚪ 0 · 0/2 resolved

🟡 packages/cli/src/cli/cmd/run.ts:497 — reasoning may emit undefined, silently dropped
🟡 packages/cli/test/cli/usage-token-breakdown.test.ts:149-159 — Source-text substring tests give false confidence

🤖 Fix all 2 open findings with your agent

Fix the following code review findings on aictrl-dev/cli PR #87 (head branch).
Run the relevant tests/linters after each change.

1. packages/cli/src/cli/cmd/run.ts:497 — reasoning may emit undefined, silently dropped
   Detail: The constructed `tokens` object copies `info.tokens.reasoning` (and the cache fields) verbatim. `JSON.stringify` silently omits `undefined` values, so if the upstream `StepFinishPart` accumulation leaves `reasoning` unset for non-thinking models, the emitted payload will lack a `reasoning` field entirely rather than emitting `0` — violating EVENTS.md's contract ("reasoning (number) ... 0 when thinking is off"). The same applies to `cache.read`/`cache.write` if those are optional on the type. Worth verifying the `info.tokens` shape; a defensive default (`reasoning: info.tokens.reasoning ?? 0`, and likewise for cache.read/cache.write) guarantees the documented schema is always emitted.
   Suggested fix: Default each field to 0 so the documented schema is always present:

const tokens = {
  input: info.tokens.input ?? 0,
  output: info.tokens.output ?? 0,
  reasoning: info.tokens.reasoning ?? 0,
  cache: {
    read: info.tokens.cache.read ?? 0,
    write: info.tokens.cache.write ?? 0,
  },
}
2. packages/cli/test/cli/usage-token-breakdown.test.ts:149-159 — Source-text substring tests give false confidence
   Detail: The `describe("message_complete emit block shape (source-verified, #86)")` block (and the adjacent source-reading tests) load `run.ts`/`EVENTS.md` as raw strings and assert substring presence inside arbitrary byte windows (e.g. `source.slice(emitIdx - 1500, emitIdx + 200)`). These do not exercise the emit path and give false confidence: they pass if `reasoning`/`buildContextWindow`/`cache.write` appear only in comments, and they will break on harmless refactors (field reordering, comment edits) while never actually validating the arithmetic. The `contextUsed includes cache.write` test only checks that the substring "cache.write" occurs on that line — it does not validate the sum, so an incorrect rewrite would still pass. Consider replacing these with a behavioral test that stubs `Provider.getModel`, captures the `emit("message_complete", …)` payload, and asserts the structured `tokens`/`context` object (or drop them and rely on the solid `buildContextWindow` unit tests above).

📋 Out-of-diff findings (2)

Sev	Location	Finding
🟡	`packages/cli/src/cli/cmd/run.ts:497`	reasoning may emit undefined, silently dropped
🟡	`packages/cli/test/cli/usage-token-breakdown.test.ts:149-159`	Source-text substring tests give false confidence

Reviewed 3 files · 0 inline · view all 2 findings ↗

aictrl · AI code review for fast-moving teams · aictrl.dev

byapparov · 2026-06-22T10:26:53Z

Round-4 triage · PR #87 · feat/usage-token-breakdown

Verdict: CONVERGE — no code change. Both round-4 findings are FALSE or DEFER; all prior real bugs are confirmed fixed.

Finding 1 — `reasoning may emit undefined, silently dropped` (run.ts:497) · 🟡 FALSE / IGNORE

Verdict: FALSE.

info.tokens.reasoning is declared as z.number() (required, non-optional) at packages/cli/src/session/message.ts:173. TypeScript types the field as number, not number | undefined. JSON.stringify only omits undefined values — a guaranteed number (including 0) is always serialised. The concern about violating the EVENTS.md "0 when thinking is off" contract does not apply: the upstream schema guarantees the field is always populated as a number before it reaches this code path. The ?? 0 defensive defaults would add noise without correcting any real gap.

Finding 2 — `Source-text substring tests give false confidence` (test:149-159) · 🟡 DEFER

Verdict: TRUE observation, DEFER.

The test at line 149 is a deliberately narrow regression pin for the cache.write undercount bug fixed in this PR (the bug was contextUsed = input + cache.read, omitting cache.write). The comment on lines 150-153 explicitly documents this intent. It is not claiming to be a behavioral integration test — it pins the exact call-site formula so a future refactor that accidentally drops cache.write from the sum cannot pass silently. The broader critique (that source-text tests can be fooled by comments, don't verify arithmetic at runtime) is valid as general guidance, but replacing this narrow regression guard with a full Provider.getModel-stubbing integration test is out of scope for this convergence round and would likely trigger another full re-review cycle. Filed as a follow-up: consider adding a behavioral integration test as a complement, not replacement.

Prior findings confirmed resolved

Round	Finding	Status
R1	`context.used` excludes `cache.write`	Fixed — line 518 now `input + cache.read + cache.write`
R2	`limit:0` yields `Infinity` ratio	Fixed — `buildContextWindow` guards `contextLimit <= 0`
R2	`.catch(()=>null)` swallows all errors	Fixed — catch re-throws non-`ModelNotFoundError`
R3	`ratio` unclamped vs documented 0–1	Fixed — `buildContextWindow` clamps or EVENTS.md updated

aictrl-dev Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread packages/cli/src/cli/cmd/run.ts Outdated

Comment thread packages/cli/test/cli/usage-token-breakdown.test.ts Outdated

aictrl-dev Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread packages/cli/src/cli/cmd/run.ts

Comment thread packages/cli/src/cli/cmd/run.ts Outdated

Comment thread packages/cli/src/cli/cmd/run.ts Outdated

Comment thread EVENTS.md Outdated

Comment thread packages/cli/src/cli/cmd/run.ts

aictrl-dev Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread packages/cli/src/cli/cmd/run.ts Outdated

Comment thread packages/cli/test/cli/usage-token-breakdown.test.ts

aictrl-dev Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread packages/cli/src/cli/cmd/run.ts

Comment thread packages/cli/test/cli/usage-token-breakdown.test.ts

byapparov merged commit 240ddc8 into main Jun 22, 2026
6 checks passed

byapparov deleted the feat/usage-token-breakdown branch June 22, 2026 11:14

Conversation

byapparov commented Jun 22, 2026

Summary

How the cache split was already captured

Context limit source

Test plan

Uh oh!

Uh oh!

Uh oh!

aictrl-dev Bot commented Jun 22, 2026

Code review

Uh oh!

byapparov commented Jun 22, 2026

Review response — PR #87

Issues addressed (pushed to this PR)

Review claims verified false (no change needed)

Not addressed here

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aictrl-dev Bot commented Jun 22, 2026

Code review

Uh oh!

byapparov commented Jun 22, 2026

Round-2 Review Triage — PR #87 · feat/usage-token-breakdown

Uh oh!

Uh oh!

Uh oh!

aictrl-dev Bot commented Jun 22, 2026

Code review

Uh oh!

byapparov commented Jun 22, 2026

Round 3 review triage · PR #87

🟠 run.ts:514 — context.used excludes cache.write tokens

🟡 test:96-120 — Tests assert on source text, not behavior

Uh oh!

Uh oh!

Uh oh!

aictrl-dev Bot commented Jun 22, 2026

Code review

Uh oh!

byapparov commented Jun 22, 2026

Round-4 triage · PR #87 · feat/usage-token-breakdown

Finding 1 — reasoning may emit undefined, silently dropped (run.ts:497) · 🟡 FALSE / IGNORE

Finding 2 — Source-text substring tests give false confidence (test:149-159) · 🟡 DEFER

Prior findings confirmed resolved

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🟠 `run.ts:514` — context.used excludes cache.write tokens

🟡 `test:96-120` — Tests assert on source text, not behavior

Finding 1 — `reasoning may emit undefined, silently dropped` (run.ts:497) · 🟡 FALSE / IGNORE

Finding 2 — `Source-text substring tests give false confidence` (test:149-159) · 🟡 DEFER